LABORATORY  FOR 
COMPUTER  SCIENCE 


MASSACHUSETTS 
INSTITUTE  OF 
TECHNOLOGY 


*  *#MIT/LCS/TM-453##  *  DTIC 

SELECTE 

JUN  2  6  1331 


COST-SENSITIVE  ANALYSIS 
OF  COMMUNICATION 
PROTOCOLS 


Baruch  Awerbuch 
Alan  Baratz 
David  Peleg 


^blSTHggBOirST  ATPic5lT~A 

Approved  for  public 

DUtribution  Ur.Umitod 


June  1991 


545  TECHNOLOGY  SQUARE,  CAMBRIDGE,  MASSACHUSETTS  02139 


91-03524 


REPORT  DOCUMENTATION  PAGE 


la.  .■REPORT  SECURITY  CLASSIFICATION 

Unclassified 


2a.  SECURITY  CLASSIFICATION  AUTHORITY 


2b  OECLASSiF'CATION/ (DOWNGRADING  SCHEDULE 


4  PERFORMING  ORGANIZATION  REPORT  NUM8£H(S) 

M1T/LCS/TM  453 


6a  NAME  OF  PERFORMING  ORGANIZATION 

MIT  Tab  for  Computer  Science 


Sc.  ADORESS  (Cry,  Stare,  and  ZIP  Code) 


lb  RESTRICTIVE  MARKINGS 


.3.  DISTRIBUTION /AVAILABILITY  OF  REPORr 

Approved  for  public  release;  distribution 
is  unlimited. 


5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 

N000 1 4-8 9-J- 1988 


u-o  .ecnnoiogy  square 
Cambridge,  MA* 02139 


3a.  NAME  OF  -JNDiNG  ,  SPONSORING 
ORGANIZATION 
DARPA,'  CCD 


3c.  ADDRESS  (City,  State,  and  ZIP  Code) 


6o  OFFICE  SYM80L  7a.  NAME  OF  MONITORING  ORGANIZATION 
(If  applicable)  , 

Office  of  Naval  Research/ Dept .  of  Navy 


7b.  ADDRESS  (Cry,  State,  and  ZIP  Code) 

Inrormacion  Systems  Program 
Arlington,  VA  22217 


3b.  OFFICE  SYMBOL  9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 
(If  applicable) 


10.  SOURCE  OF  FUNDING  NUMBERS 


1400  Nil 2 


n  3i7C, 


PROGRAM 

PROJECT 

TASK 

ELEMENT  NO. 

NO. 

NO 

' 1 

WORK  UNIT 
ACCESSION  NO. 


TIE  (Include  Security  Classification) 

Cost-Sensitive  Analysis  of  Communication  Protocols 


'2.  PERSONAL  AUTHOR(S) 

Baruch  Averbuch,  Alan  Baratz,  David  Peleg 


'3a.  TYPE  OP  REPORT  13b  TIME  COVERED 

Technical  :30M  to 


14  DATE  OF  REPORT  (Year  Month,  Day)  15  PAGE  COUNT 

June  1901  34 


•  T 

CCSATI  COOES 

FIELD 

GROUP  I  SUB-GROUP 

18.  SUBJECT  TERMS  ( Continue  on  reverse  if  necessary  and  identify  by  block  number) 


'9  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

This  paper  introduces  the  notion  of  cost-sensitive  communication  complexity  and 
exemplifies  it  on  the  following  basic  communication  problems:  computing  a  global 
function,  network  synchronization,  clock  synchronization,  controlling  protocols’  worst- 
case  execution,  connected  components,  spanning  tree,  etc.,  constructing  a  minimum 
spanning  tree,  constructing  a  shortest  path  tree. 


20  DISTRIBUTION -AVAILABILITY  OF  ABSTRACT  21  ABSTRACT  SECURITY  CLASSIFICATION 

'  □  uNCLASSiF'EDiUNUMiTED  □  same  as  rpt  □  otic  USERS  Unclassified 


22 a  NAME  OF  RESPONSIBLE  INDIVIDUAL  22b  TELEPHONE  (Include  Area  Code)  22c.  OFFICE  SYMBOL 

Carol  Ticulora  (Cl  ,  2j J-duv- _  _  _ __ 


DO  FORM  1473,  34  mar  33  APR  edition  may  be  used  until  exhausted  SECURITY  CLASSIFICATION  OF  THIS  PAGE 

All  other  editions  are  obsolete 

♦US.  Gowwmt  OW«  TW&-W7-047 

Unclassified 


Cost-Sensitive  Analysis  of  Communication  Protocols 

Baruch  Awerbuch  *  Alan  Baratz  *  David  Peleg  * 

June  13,  1991 


Abstract 


This  paper  introducer  the  notion  of  cost-sensitive  communication  complexity  and 
exemplifies  it  on  the  following  basic  communication  problems:  computing  a  global 
function,  network  synchronization,  clock  synchronization,  controlling  protocols’  worst- 
case  execution,  connected  components,  spanning  tree,  etc.,  constructing  a  minimum 
spanning  tree,  constructing  a  shortest  path  tree. 


Ac far 

MT\?  OftJJI 
me  TAJ 

Justification. 


1  Distribution/ 

{  Availability  CoAaa 

i  Avail  and  for 

riat  $al 


‘Department  of  Mathematics  and  Laboratory  for  Computer  Science,  Cambridge,  MA  02139. 

ARPA:  baruch'Q'theory.lcs. mit.edu.  Supported  by  Air  Force  Contract  TNDGAFOSR-86-0078,  ARO  contract 
DAAL03-86-K-0171,  NSF  contract  CCR8611442,  DARPA  contract  N00014-89-J-1988,  and  a  special  grant 
from  IBM.  Part  of  the  work  was  done  while  visiting  IBM  T.J.  Watson  Research  Center. 

*1BM  T.J.  Watson  Research  Center,  Yorktown  Heights,  NY  10598. 

*  Department  of  Applied  Mathematics  and  Computer  Science,  The  Weizmann  Institute,  Rehovot  76100, 
Israel.  BITNF.T:  peleg®wisdom.  Supported  in  part  by  an  Allon  Fellowship,  by  a  Bantrell  Fellowship  and  by 
a  Haas  Career  Development  Award.  Part  of  the  vork  was  done  while  visiting  MIT  and  IBM  T.J.  Watson 
Research  Center. 


1 


Introduction 


1.1  Motivation 

Traffic  load  is  one  of  the  major  factors  affecting  the  behavior  of  a  communication  network. 
This  fact  is  well  recognized,  and  is  the  reason  why  most  models  for  communication  networks 
and  most  algorithms  for  routing,  traffic  analysis  etc.  model  the  network  using  a  weight 
function  on  the  edges,  capturing  this  factor.  In  this  model,  the  weight  of  an  edge  reflects 
the  estimated  delay  for  a  message  transmitted  on  this  edge,  and  thus  also  the  cost  for 
using  this  edge.  The  significance  of  the  load  factor  has  also  motivated  the  intense  study  of 
efficient  methods  for  performing  basic  network  tasks  such  as  computing  shortest  paths  and 
constructing  minimum  weight  spanning  trees  (with  length  /  weight  defined  with  respect  to 
the  weight  function). 

However,  in  most  of  the  previous  work  on  distributed  algorithms  for  these  and  other 
tasks,  the  design  and  analysis  of  the  algorithms  themselves  completely  disregards  this  weight 
function.  That  is,  transmission  over  all  the  edges  is  assumed  to  be  equally  costly  and 
completed  within  the  same  time  bound.  Such  assumptions  are  made  even  when  the  task 
performed  by  the  algorithm  is  directly  related  to  the  edge  costs,  and  the  algorithm  has  to  be 
executed  over  the  same  network,  and  thus  suffer  the  same  delays.  This  seems  to  contradict 
the  very  purpose  towards  which  the  tasks  are  performed.  It  is  sometimes  argued  that  it  is  not 
crucial  to  take  the  weights  into  account  when  considering  such  “network  service”  algorithms, 
since  these  algorithms  occupy  only  a  thin  slice  of  the  network’s  bandwidth.  Nonetheless,  it 
is  clear  that  an  algorithm  that  can  do  well  in  that  respect  is  preferable  to  one  that  ignores 
the  issue. 

This  paper  proposes  an  approach  enabling  us  to  take  traffic  loads  into  account  in  the 
design  of  distributed  algorithms.  This  issue  is  addressed  by  introducing  cost-sensitive  com¬ 
plexity  measures  for  analysis  of  distributed  protocols.  We  consider  weighted  analogs  for  both 
communication  and  time  complexity.  We  then  examine  a  host  of  basic  network  problems, 
such  as  connectivity,  computing  global  functions,  network  synchronization,  controlling  the 
worst-case  execution  of  protocols,  and  constructing  minimum  spanning  trees  and  shortest 
path  trees.  For  each  of  these  problems  we  seek  to  establish  some  lower  bounds  and  propose 
some  efficient  algorithms  with  respect  to  the  new  complexity  measures. 

We  feel  that  the  approach  proposed  in  this  paper  may  serve  as  a  basis  for  a  more  accurate 
account  of  the  behavior  of  distributed  algorithms  in  communication  networks. 


1.2  The  model 


We  consider  the  standard  model  of  (static)  asynchronous  communication  networks.  We 
consider  a  communication  graph  G  =  (V7,  E,w ),  where  a  weight  w(e)  is  associated  with  each 
(undirected)  edge  of  the  network.  We  denote  n  =  |V|,  m  =  \E\.  We  also  denote  by  W  the 
maximal  weight  w(e)  of  a  network  edge,  W  —  max(Ui„)e£  u>(u,  u).  We  make  the  assumption 
that  W  =  poly(n),  and  thus  log  W  =  O(logn).  For  any  subgraph  G'  =  ( V',E',w )  of  G , 
let  w(G')  denote  the  total  weight  of  G' ,  i.e.,  w(G')  =  I IeeE'u’(e)-  Let  dist(u,v,G')  be  the 
weighted  distance  from  u  to  v  in  G',  i.e.,  the  minimum  of  w{p)  over  all  paths  p  from  u  to 
v  in  G',  and  let  Path(u,v,G')  denote  some  arbitrary  path  achieving  this  minimum.  Let 
Diam(G')  denote  the  diameter  of  G',  i.e.,  maxUil,ev"  dist(u,  v,  C).  Given  a  tree  T  and  two 
vertices  x,y  in  it,  denote  by  Pulh(x,  y,T)  the  path  from  x  to  y  in  T. 

Next  let  us  define  some  basic  graph  notation.  For  a  vertex  £  V',  let 

Rad(v,G)  =  max(distc{v,w)). 

tug  V' 

Given  a  set  of  vertices  S  C  V',  let  G(S)  denote  the  subgraph  induced  by  S  in  G.  A  cluster 
is  a  subset  of  vertices  S  C  V  such  that  G(S)  is  connected.  The  radius  of  a  cluster  S  is 
denoted  Rad(S)  =  rninv65  Rad(v,  G(S)).  A  cover  is  a  collection  of  clusters  S  ~  {Si, . . . ,  Sm} 
such  that  Ui  Si  -  V7.  Given  a  collection  of  clusters  S,  let  Rad(S)  =  ma Rad(S,).  For 
every  vertex  v  £  V7,  let  degs(v)  denote  the  degree  of  v  in  the  hypergraph  (V,S),  i.e.,  the 
number  of  occurrences  of  r  in  clusters  S  £  S.  The  maximum  degree  of  a  cover  S  is  defined 
as  A(S)  =  maxv€v  dcgs(v). 

Given  two  covers  S  —  {Si, . Sm}  and  T  =  {'/'] . } ,  we  say  that  T  subsumes  S  if 

for  every  S,  £  S  there  exists  a  1)  £  T  such  that  S,  C  Tj. 

In  the  sequel  we  make  us"  >f  the  f  allowing  theorem  of  [AP91]. 

Theorem  1.1  [AP91]  Given  n  graph  G  =  (V,E),  \V\  =  n,  an  initial  cover  S  and  an  integer 
/.■>!,  it  is  oossible  to  construct  a  cover  T  that  satisfies  the  following  properties: 

(1)  T  subsumes  S, 

(2)  Rad(T)  <  (2k  -  l)Rad(S),  and 

(3)  A{T)  =  0(k\S\''k). 

I 


2 


1.3  The  complexity  measures 


This  paper  introduces  weighted  complexity  measures  analogous  to  the  traditional  time  and 
communication  measures.  We  define  the  cost  of  transmitting  a  message  over  an  edge  e 
as  w(e).  The  communication  complexity  of  a  protocol  tt,  denoted  c*,  is  the  sum  of  all 
transmission  costs  of  all  messages  sent  during  the  execution  of  tt.  The  time  complexity  of  the 
protocol  tt,  denoted  tT,  is  the  maximal  physical  time  it  takes  7r  to  complete  its  execution, 
assuming  that  the  delay  on  an  edge  e  varies  between  0  and  w(e).  The  classical  complexity 
measures  correspond  to  the  case  where  w(e)  =  1  for  all  e  €  E. 

Traditionally,  communication  protocols  are  evaluated  in  terms  of  E,  V,D,  which  denote, 
ifcspectively,  the  number  of  edges,  the  number  of  vertices,  and  the  unweighted  (hop  based) 
diameter  of  the  network.  It  turns  out  that  it  is  convenient  to  evaluate  the  weighted  com¬ 
plexity  of  protocols  using  the  “weighted  analogs”  of  E,V,D,  denoted  by  £,V,D,  which  are 
defined  as  follows: 

£  =  w(G)  (  =  Y,  Me)  ) 

e£E 

V  =  w(T)  where  T  is  an  MST  of  G 
T>  =  Diam(G) 

The  analogy  between  these  parameters  and  their  unweighted  counterparts  is  manifested 
in  the  fact  that  £  equals  the  total  cost  of  transmitting  a  single  message  over  all  the  edges  of 
the  network,  V  is  the  minimal  cost  of  reaching  (or,  disseminating  a  message  to)  all  vertices, 
and  V  is  the  maximal  cost  of  transmitting  a  message  between  a  pair  of  network  nodes. 

In  the  sequel  we  express  the  complexity  of  our  algorithms  in  terms  of  £,  V  and  V.  This 
gives  results  that  are  conveniently  similar  in  appearance  to  the  results  of  the  unweighted 
case,  as  follows  from  the  statement  of  results  in  the  following  subsection. 


1.4  Problems  and  results 

1.4.1  Global  function  computation 

The  problem:  We  are  concerned  here  with  computing  global  functions  in  a  network.  We 
assume  that  the  structure  of  the  network  is  known  to  all  the  vertices  (including  the  edge 
weights).  The  only  unknowns  are  the  values  of  the  n  arguments  of  the  function,  which  are 
initially  stored  at  different  vertices  of  the  network,  one  at  each  vertex.  The  outputs  must  be 
be  produced  at  all  the  vertices. 

We  restrict  ourselves  to  the  family  of  functions  called  symmetric  compact  in  [GS86]. 
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Global  function  computation 

Communication 

Time 

Upper  bound 

0(V) 

0(V) 

Lower  bound 

ft(V) 

ft(2>) 

Figure  1:  Lower  and  upper  bounds  for  global  function  computation. 

The  functions  /„  :  Xn  — ►  X  in  this  family  are  symmetric  (i.e.,  any  two  arguments  can 
He  switched)  and  compact,  in  the  sense  that  the  contribution  of  any  subset  of  arguments 
can  be  represented  in  “compact  form”  by  a  string  of  size  log2  pf|.  The  latter  condition  is 
formalized  by  assuming  that  there  exists  a  function  g  :  X 2  — ►  X  such  that  for  any  k  <  n, 

f(Xi,X2...  Xn)  =  g(fk(xi ,  X2  .  .  .  £fc),  fn—k(Xk+l  >  •  •  ■  Xn))- 

Computing  such  functions  is  quite  a  basic  task  in  the  area  of  network  protocols.  Many 
functions  belong  to  this  family,  e.g.  maximum,  sum,  basic  boolean  functions  ( XOR ,  AND , 
OR).  Many  other  tasks,  e.g.  broadcasting  a  message  from  a  given  node  to  the  rest  of  the 
network,  termination  detection,  global  synchronization,  etc.  can  be  represented  as  computing 
a  symmetric  compact  function.  A  similar  class  of  functions  is  considered  in  [ALSY88]. 

The  results:  We  show  that  the  computation  of  global  functions  requires  0(V)  messages 
and  ©(D)  time. 

The  upper  bound  is  derived  as  follows.  Define  a  spanning  tree  as  shallow-light  tree  (SLT) 
if  its  diameter  is  0(T>)  and  its  weight  is  0(V).  We  then  show  that  SLT  trees  are  effectively 
constructible,  which  implies  that  computing  the  value  of  our  global  function  can  be  performed 
(optimally)  with  0(V)  messages  and  0{V)  time. 

We  are  also  concerned  with  efficient  distributed  constructions  of  SLT  trees,  or,  in  short, 
SLT  algorithms.  We  present  a  specific  SLT  algorithm  that  requires  0(V-n2)  communication 
and  0(T>  -  n2)  time. 

1.4.2  Clock  Synchronization 

Problem:  The  purpose  of  the  clock  synchronization  is  to  generate  at  each  node  a  sequence 
of  pulses,  such  that  pulse  p  at  a  node  is  generated  after  (in  the  “causal”  sense  [Lam78])  all 
neighbors  generate  pulse  p  —  1. 

As  argued  by  Even  and  Raijsbaum  [ER90],  the  relevant  complexity  measure  here  is  the 
“pulse  delay”,  which  is  the  maximal  time  delay  in  between  two  successive  pulses  at  a  node. 
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Let  us  denote  d  =  rnax(U)t))6£  dist(u,  t>),  i.e.,  d  is  largest  distance  between  neighbors  in  the 
network.  Clearly  d  <  W ,  and  the  problem  is  interesting  when  d  W .  A  lower  bound  of 
fl(d),  and  an  upper  bound  of  0(W )  are  derived  in  [ER90].  (It  is  worth  pointing  out  that 
the  main  emphasis  of  [ER90]  is  on  somewhat  different  “directed”  version  of  this  problem.) 

Results:  In  this  paper,  we  show  that  one  can  achieve  a  pulse  delay  of  0(d-\ og2  n),  i.e.  leave 
a  gap  of  log2  n  between  the  lower  and  upper  bounds.  This  result  relies  heavily  on  a  number 
of  existing  techniques,  like  the  “Network  Partition”  of  [AP91],  and  the  “Synchronizer  7”  of 
[Awe85a], 

1.4.3  Network  Synchronization 

The  problem:  Asynchronous  algorithms  are  in  many  cases  substantially  inferior  in  terms 
of  their  complexity  to  corresponding  synchronous  algorithms,  and  their  design  and  analysis 
are  more  complicated.  This  motivates  the  development  of  a  general  simulation  technique, 
known  as  the  synchronizer ,  that  allows  users  to  write  their  algorithms  as  if  they  are  run 
in  a  synchronous  network.  Implicitly,  such  techniques  were  proposed  already  in  [Jaf80]  and 
[Gal82].  The  first  explicit  statement  of  the  problem  was  given  in  [Awe85a],  and  better 
constructions  for  various  cases  were  given  in  [PU89,  AP90a].  Our  goal  is  to  extend  the 
concept  of  the  synchronizer  to  the  weighted  case  and  provide  an  appropriate  construction. 

On  a  conceptual  level,  the  synchronizer  (as  well  as  the  controller ,  described  in  following 
sections)  is  a  protocol  transformer,  transforming  a  protocol  7r  into  a  protocol  <f>  that  is 
equivalent  to  7r  in  some  sense  but  enjoys  some  additional  desirable  properties.  Recall  that  c* 
and  f n  denote  the  communication  and  time  complexity  of  the  protocol  w,  and  similarly  for 
c<£  and  t Our  purpose  is  to  guarantee  that  the  transformation  maintains  c $  and  small 
compared  to  c „  and  tn. 

The  synchronizer  can  be  viewed  as  a  way  to  remove  variations  from  link  delays  in  an 
asynchronous  network.  In  the  “unweighted”  case,  this  means  tiiat  we  want  to  “force”  all 
link  delays  to  be  exactly  1.  In  the  “weighted”  case,  the  most  natural  and  most  useful 
generalization  of  this  concept  is  to  force  the  delay  on  each  link  e  to  be  exactly  ie(e).  In  a  sense, 
the  synchronizer  enables  to  simulate  a  “weighted”  synchronous  network  G(V,  E,  w)  with  each 
link  e  having  a  delay  of  exactly  w(e)  by  a  “weighted”  asynchronous  network  G(V,  E,  w).  Such 
simulations  may  be  useful  for  various  applications,  for  which  the  absence  of  variations  in  edge 
delays  significantly  simplifies  the  tasks  in  hand,  e.g.,  shortest  paths  [Awe89],  constructing 
routing  tables  [ABLP891,  and  others.  However,  in  addition  to  simplifying  protocol  design  and 
analysis,  synchronizers  actually  lead  to  complexity  improvements  for  concrete  algorithms. 
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For  example,  the  algorithm  SPT„vnch  derived  via  a  synchronizer  (presented  in  Subsection 
9.1),  is  the  best  known  shortest  path  algorithm  for  certain  values  of  V,P,  £. 

We  define  the  amortized  costs  of  a  synchronizer  £  (i.e.,  the  overhead  per  pulse)  in  com¬ 
munication  and  time  as  follows. 

C 

At  first  sight,  the  clock  synchronization  problem  from  Subsection  1.4.2  seems  to  resem¬ 
ble  the  problem  of  simulating  an  “unweighted”  synchronous  network  G'(V,E)  (with  all  link 
delays  being  exactly  1)  by  a  “weighted”  asynchronous  network  G(V,  E,  w).  The  main  differ¬ 
ence  is  in  the  fact  that  the  only  goal  of  the  network  synchronizer  is  to  simulate  a  particular 
protocol,  whereas  the  purpose  of  the  clock  synchronizer  is  to  generate  pulses.  In  general,  it 
would  be  ineffective  to  use  clock  synchronizers  for  network  synchronization,  and  vice  versa. 
Even  though  the  methods  that  we  use  to  handle  both  problems  have  certain  techniques  in 
common,  the  differences  are  quite  substantial. 

The  results:  We  construct  a  synchronizer  yw,  which  is  an  analog  of  synchronizer  y  of 
[Awe85a],  such  that  for  any  fixed  parameter  k, 

Cp(lw)  =  0(kn  ■  log  n) 

Tpilw)  =  O(logfc  n  •  log  n) 

1.4.4  Controllers 

Problem:  The  controller  [AAPS87]  is  a  protocol  transformer  transforming  a  protocol  7r 
into  a  protocol  (f>  that  is  equivalent  to  n  in  terms  of  its  input-output  relation  on  a  static 
network,  but  is  more  “robust”  than  n  in  the  sense  that  it  has  “reasonable”  complexity  even 
if  it  operates  on  “wrong”  data. 

Results:  In  the  unweighted  case,  (AAPS87)  presents  a  controller  guaranteeing  —  c#  = 
0(c„  •  log2c*).  We  show  that  the  same  bounds  hold  for  the  weighted  case  as  well. 

1.4.5  Connected  components,  spanning  tree 

The  problems:  The  problems  considered  here  are  finding  connected  components  and  con¬ 
structing  a  (not  necessarily  minimum)  spanning  tree  [Seg»4,  AGPV89],  These  problems  are 
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Connectivity 

Communication 

Time 

DFS 

0(C) 

0(C) 

C0Ny;oo<J 

0(C) 

0(V) 

CONhytrid 

0(min{£,  n  •  V’}) 

0(min{£,  n  ■  V}) 

Lower  bound 

f2(min{£,  n  ■  V}) 

Sl(V) 

Figure  2:  Our  Connectivity  algorithms. 


equivalent  to  each  other. 

The  results:  We  show  that  performing  any  of  the  above  tasks  requires  0(min{£,n  •  V}) 
communication  by  providing  matching  upper  and  lower  bounds.  To  be  more  precise,  we 
prove  that 

1.  For  every  distributed  connectivity  algorithm  A  and  for  any  n  there  exists  a  family  of  n 
vertex  graphs  G  on  which  A  requires  communication  complexity  f l(n  ■  V)  and  a  family 
of  n  vertex  graphs  G  on  which  A  requires  communication  complexity  ft(£). 

2.  There  is  a  distributed  connectivity  algorithm  with  communication  complexity 
0(min{£,n  ■  V})  on  any  graph  G. 

1.4.6  Constructing  minimum  spanning  trees 

F  'oblem:  The  minimum  spanning  tree  (MST)  of  the  graph  G  is  a  tree  of  minimum  weight 
spanning  G. 

Results:  We  develop  a  number  of  MST  algorithms,  based  on  modifications  of  the  algo¬ 
rithms  of  [GIIS83,  Awe87]. 

1.  An  algorithm  with  communication  complexity  0(min{£  +  V  •  logn,  n  ■  V}). 

2.  An  algorithm  with  communication  complexity  0(£  ■  log  n  log  V)  and  time  complexity 
0{T>  ■  n  ■  log  n  •  log  V). 
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Minimum  Spanning  Trees  (MST) 

Algorithm 

Communication 

Time 

KST  ,h. 

0(£  +  V  logn) 

0{£  +  V  logn) 

ltST«„,r 

O(nV) 

0(n  Diam(MST) 

MST/o,« 

0(£  •  log  n  log  V) 

0(V  n  logn  log V). 

HST/iy  ^rtd 

0(min{£ ,  n  -V  logn)) 

0(min{£, n  V  •  logn}) 

Lower  bound 

fl(min{£,  n  •  V}) 

Figure  3:  Our  MST  algorithms. 


Shortest  Path  Trees  (SPT) 

Algorithm 

Communication 

Time 

SPTCentr 

0(w(SPT)  n)  =  0(n2  •  V) 

0(V  n) 

SPTrr  cur 

0(£1+<) 

0(D1+<) 

SP  a  synch 

0(£  +  T>  kn  ■  log  n) 

T>  log*  n  log  n 

SPT^y  brxd 

0(min{£  +  V  ■  kn  ■  logn),£1  +  '}) 

0(V'+<) 

Lower  bound 

Q(min{£,  n  V}) 

Q(V) 

Figure  4:  Our  SPT  algorithms 


1.4.7  Constructing  shortest  path  trees 

Problem:  The  shortest  paths  tree  (SPT)  of  the  graph  G  with  respect  to  a  source  vertex 
5  G  V  is  a  tree  defined  by  the  collection  of  shortest  paths  from  s  to  all  other  vertices  in  G. 

Results:  We  develop  a  number  of  SPT  algorithms: 

1.  An  algorithm  with  communication  complexity  0(£'+()  and  time  complexity  0(T>i+'). 
This  is  analogous  to  the  result  of  (Awe89j,  which  achieves  same  result  for  the  unweighted 
case. 

2.  An  algorithm  with  communication  complexity  0(£  +  V- kn  •  log  n  )  and  time  complexity 
0(D  •  log*  n  log  n). 


1.5  Structure  of  this  paper 

The  paper  proceeds  as  follows.  Section  2  gives  tight  upper  and  lower  bounds  on  the  compu¬ 
tation  of  global  functions.  Section  3  contains  clock  synchronization  algorithms.  In  Section  1, 


X 


we  give  upper  and  lower  hounds  for  network  synchronizers.  Section  5  deals  with  controller 
algorithms.  Section  6  discusses  basic  algorithmic  techniques  for  network  problems  such  as 
broadcast,  depth  first  search  and  construction  of  minimum  spanning  trees  and  shortest  path 
trees,  -ection  7  discusses  the  problems  of  broadcast  and  constructing  connected  components 
and  spanning  trees.  Finally,  Sections  8  and  8  describe  efficient  algorithms  for  constructing 
minimum  spanning  trees  and  shortest  path  trees,  respectively. 


2  Optimal  computation  of  global  functions 

2.1  The  lower  bound 

Theorem  2.1  The  computation  of  global  symmetric  compact  functions  requires  P(V)  commu¬ 
nication  and  HfP)  time. 

Proof:  Suppose  that  the  value  of  the  function  has  been  computed  at  the  vertex  v.  Sin  v. 
the  value  of  a  global  function  depends  on  the  value  of  all  of  its  arguments,  there  must  be 
some  information  flow  from  each  of  the  vertices  to  v.  Thus  the  subgraph  G'(V,E'),  defined 
by  the  set  of  edges  E'  traversed  by  messages  of  the  protocol,  must  contain  a  path  from  v  to 
any  other  vertex  in  V7,  i.e.,  it  must  contain  some  spanning  tree  of  G. 

Observe  that  the  distance  dist(u,i\G')  from  v  to  any  other  vertex  n  €  V  is  a  lower 
bound  on  time  complexity  of  the  protocol.  Picking  a  pair  of  vertices  u,u  realizing  V  (i.e., 
maximizing  the  distance  dist(u,v,G ))  and  noting  that  dist(u,v,G')  >  dist(u,v,G)  =  V ,  we 
get  that  D  is  a  lower  bound  on  the  time  complexity  of  the  protocol. 

Furthermore,  the  total  weight  of  the  edges  of  (?',  w(G'),  is  a  lower  bound  on  the  com¬ 
munication  complexity  of  the  computation.  Now,  since  G'  contains  a  spanning  tree  of  G,  its 
total  weight  satisfies  w(G')  >  V.  Thus  V  lower  bounds  the  communication  complexity  of 
the  protocol.  | 

2.2  The  upper  bound 

It  is  easy  to  see  that  given  a  spanning  tree  T  for  the  network,  a  global  function  can  be 
computed  with  communication  complexity  w(l')  and  time  complexity  Diani(T).  Clearly, 
any  shortest  path  tree  Ts  has  small  depth,  namely  Diam(Ts)  —  O(P),  but  its  weight  may 
be  as  big  as  w(Ts)  =  Ji(n  •  V)  [BKJ83].  Analogously,  any  minimum  spanning  tree  Tm  has 
small  weight,  namely  w(T\i)  —  V;  but  its  depth  may  be  as  high  as  Diam(T\i )  =  P(n  •  V) 
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[BKJ83].  Thus  in  [BKJ83,  Jaf85]  it  is  advocated  to  approach  such  problems  by  attempting 
to  construct  a  tree  approximating  both  a  shortest- path  tree  and  a  minimum- weight  spanning 
tree. 

Recall  that  a  spanning  tree  is  shallow-light  tree  (SLT)  if  its  diameter  is  0(1?)  and  its 
weight  is  0(V).  Such  trees  minimize  simultaneously  both  weight  and  depth;  existence  of  such 
tree  would  imply  that  in  any  graph,  one  can  compute  global  functions  with  communication 
complexity  O(V)  and  0(T>)  time.  However,  it  is  not  clear  that  such  trees  exist.  In  the  next 
subsection  we  establish 

Theorem  2.2  Every  graph  has  a  shallow-light  spanning  tree 

Corollary  2.3  The  computation  of  global  symmetric  compact  functions  can  be  performed  with 
communication  complexity  (){V)  and  0{'V)  time.  | 

The  shallow-light  tree  algorithm 

We  next  provide  an  algorithm  (hereafter  referred  to  as  the  SLT  algorithm)  for  constructing 
an  SLT  for  an  arbitrary  graph,  thus  proving  Theorem  2.2. 

1.  Construct  an  MST  I’m  and  an  SFT  Ts  for  (7,  rooted  at  an  arbitrary  vertex  e0. 

2.  'Traverse  I'm  in  a  depth-first  search  (DFS)  fashion,  starting  from  r0.  We  think  of 
the  DFS  as  carried  out  by  a  “token”,  representing  the  algorithm's  center  of  activity. 
Observe  that  in  the  tour  taken  by  this  token  through  the  tree  according  to  the  specifi¬ 
cations  of  the  I)KS,  each  tree  edge  is  traversed  exactly  twice.  Define  the  “mileage”  of 
the  DFS  token  at  a  given  time  to  be  the  number  of  steps  (forward  and  backward  tree 
edge  traversals)  up  to  this  time.  Denote  by  i'(i)  (0  <  i  <  2 (n  —  1))  the  location  of  the 
DFS  token  at,  the  time  its  mileage  is  exactly  i.  For  example,  e(0)  —  e(2n  —  2)  =  t’o, 
where  i\,  is  the  source  of  the  DFS. 

3.  Construct  the  “line- version”  /.  of  / ,  which  is  a  (weighted)  path  graph  containing 
vertices  0,  I,...  ,2 n  -  2.  A  vertex  ?  on  the  path  corresponds  to  //(/).  We  assign  each 
edge  c  -  (/,  i  f  1)  on  the  line  L  the  weight  of  t  he  corresponding  edge  (//(?),  /'( i  +  1))  in 
the  graph  (!.  Observe  that  neighboring  vert  ires  on  the  line  /,  correspond  to  neighbors 
in  I’m-  Observe  that  the  total  weight  of  the  line  is  at  most  twice  the  total  weight  of 
the  MS  T  I\,,  i.e.f  w{L)  <  2V. 

1.  Fix  a  parameter  </  >  ().  Construct  “break  points”  II,  on  /,  by  scanning  it  from  loft  to 
right  according  to  the  following  rules. 


II) 


Construct  a  minimum  spanning  tree  Tm  for  G. 

Construct  a  shortest  path  tree  T$  for  G. 

Construct  L  based  on  TM  as  described  above. 

Assign  each  edge  e  of  L  the  same  weight  as  u(e)  in  G. 

E'^Tm 
X  «-  0;  Y  <-  0 
repeat 

repeat  Y  <—  Y  +  1 

until  dist(X,  Y,  L)>  q  ■  dist(X,  Y,  Ts) 

E'  <-  E'\J  Path(X,  Y,  Ts) 

x  -  y 

until  Y  —  n 

Construct  a  shortest  path  tree  T  in  G1  =  (V,  E') 

output  T 

Figure  5:  The  SLT  algorithm 

(a)  Break-point  B\  is  vertex  0  on  the  line  L. 

(b)  Break  point  B,+ i  is  the  first  point  to  the  right  of  B,  such  that 
dist(Bi,  Bl+UL)  >  q  ■  dist{v(B,),i'{Bi+i),Ts), 

meaning  that  the  distance  from  £?,  to  £?,+ 1  in  Tm  exceeds  that  in  Ts  by  a  factor 
of  at  least  q. 

5.  Create  a  subgraph  G'  of  G  by  taking  Tm  and  adding  Path(i/(Bi),  v(Bt+i),Ts)  for  all 
break-points  B,  i  *  >  1. 

6.  Construct  a  shortest  path  tree  T  rooted  at  t>0  in  the  resulting  graph  G' . 

7.  Output  T. 

The  algorithm  is  presented  formally  in  Figure  5.  In  the  algorithm,  T  denotes  the  set  of 
edges  selected  to  the  shallow-light  tree  and  A',  Y  are  pointers  to  nodes  on  the  line  L.  An 
example  for  a  possible  execution  of  the  algorithm  is  given  in  Figure  6. 


2.3  Analysis 

Lemma  2.4  The  tree  T  constructed  by  the  algorithm  satisfies  w(T)  <  (1  +  |)V. 


11 


Ts 

c  Y  ^ 

Bt  B2 

f  1 

X  Y 

Figure  6:  An  example  run  of  the  SLT  algorithm 


Proof:  The  tree  T  is  a  subgraph  of  G' ,  created  by  adding  the  paths  Path(v(Bl),  i/( £?,+ 1 ),  Ts), 
for  i  >  1,  to  l\f.  Therefore 


w{T)  <  u>(G')  =  w(Tm)  +  Y2 

i>  1 

But  by  choice  of  the  breakpoints  Bt, 

w(Path{v(B,),v(Bi+l),Ts))  =  dist(»(B,),  »(B,+  l),Ts)  <  ■  dist(B„  Bi+l,  L), 


hence 

y;  w(Path(v(Bl),  ^(7?,+ 1 ),  Ts))  <  ~^2dist(Bi,  Bi+U  L)  <  -  •  w(L)  <  -  •  2V, 

i>i  (>  i>i  9  9 

and  thus  u;^1)  <  (1  T  ^)V.  I 

Lemma  2.5  The  tree  T  constructed  by  the  algorithm  satisfies  Diam(T)  <  (q  +  \)V. 


Proof:  Consider  an  arbitrary  vertex  x  (z  T  .  We  need  to  show  that  its  depth  in  7  is  at  most 
(q  +  \  )V.  For  that  it  suffices  to  bound  di»t(i>0,  x,G')-  Let  j  denote  the  point  corresponding 
to  x  on  the  line  L.  Suppose  that  Be <  j  <  Be+ 1,  i.e.,  j  occurs  on  the  line  L  between  B(  and 
B(+ 1  for  some  £.  Since  Patk(v(Bi),v{Bl+\),Ts)  is  included  in  G'  for  every  1  <  r  <  7  —  1,  it 
follows  that  the  entire  path  Palh{v{B\),v{Bf),Ts)  connecting  the  nodes  corresponding  to 
B\  and  Be  in  T$  is  included  in  G'\  and  hence 


di$t(vu,v(B/),G')  <  T>. 


If  j  =  Be  then  we  are  done.  Otherwise,  by  the  choice  of  Bf+\, 

dist(Be,j,  7)  <  q  ■  dist(v(B(),x,Ts). 


Put  together,  we  get  that  dist(vo,x,G')  <  (</  T  \  )T>.  | 

Corollary  2.6  The  tree  T  constructed  by  the  algorithm  is  an  SLT. 
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2.4  Distributed  construction  of  shallow-light  trees 

Theorem  2.7  There  is  a  distributed  algorithm  for  constructing  an  SLT  requiring  0(V  •  n2) 
communication  and  0(V  •  n2)  time. 

Proof:  By  Subsection  6.3,  the  MST  Tm  can  be  constructed  using  Algorithm  MSTcentr  with 
0(n  •  V)  communication  and  0(n 2  ■  T> )  time. 

Now,  the  rest  involves  stretching  the  MST  into  a  line.  By  Fact  6.3,  the  total  weight 
of  the  MST  is  at  most  n  —  1  times  the  diameter,  or,  V  <  (n  —  1)T>.  Thus  the  time  and 
communication  of  the  main  body  of  the  SLT  algorithm  are  both  0(n2  •  V). 

Finally,  we  need  to  compute  one  more  SPT  in  order  to  get  the  final  tree  T  out  of  the 
subgraph  G'.  This  is  done  using  Algorithm  SPTcenir  of  Subsection  6.4,  and  costs  us  additional 
0(V  ■  n)  time  and  0(n 2  •  V)  communication. 

Overall,  the  algorithm  invests  0(V  ■  n2)  time  and  0(V  •  n2)  communication.  | 


3  Clock  synchronization 

In  this  section  we  describe  three  methods  of  clock  synchronization,  called  synchronizer  a*, 
0*  and  7*.  These  are  modifications  of  synchronizers  a,  0  and  7  of  [Awe85a]. 

3.1  Clock  synchronizer  a* 

As  pointed  out  in  [ER90],  the  most  natural  approach  to  clock  synchronization  is  to  use  the 
following  synchronization  mechanism,  called  synchronizer  a". 

Pulse  generation:  whenever  a  node  generates  pulse  p,  is  send  messages  to  all  neighbors, 
and  when  it  receives  messages  of  pulse  p  from  all  neighbors,  it  generates  pulse  p  +  1. 

This  method  clearly  requires  time  proportional  to  the  highest  edge  weight,  namely  0(W). 
Our  goal  is  to  approach  the  lower  bound,  which  is  0(d)  (recall  that  d  is  largest  distance 
between  neighbors). 

The  naive  way  to  improve  the  delay  is  to  construct  a  shortest  path  Path(u,v)  for  all 
( u,v )  €  E  and  to  communicate  with  each  neighbor  over  such  path.  The  problem  with  this 
method  is  that  a  particular  edge  may  belong  to  many  paths  (up  to  E ),  and  thus  the  resulting 
congestion  will  slow  down  the  communication  time  by  the  corresponding  factor  (up  to  E). 
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3.2  Clock  synchronizer  (3* 

In  order  to  minimize  congestion,  we  may  try  the  following  method,  called  synchronizer  /?*. 

Preprocessing:  We  construct  a  spanning  tree  T  of  the  network,  and  select  a  “leader”  to 
be  the  root  of  this  tree. 

Pulse  generation:  Information  about  the  completion  of  the  current  pulse  is  gathered  no 
the  tree  by  means  of  a  communication  pattern  referred  to  as  convergecast ,  which  is  started 
at  the  leaves  of  the  tree  and  terminates  at  the  root.  Namely,  whenever  a  node  learns  that  it 
is  done  with  this  pulse  and  all  its  descendants  in  the  tree  are  done  with  it  as  well,  it  reports 
this  fact  to  its  parent.  Thus  within  finite  time  after  the  execution  of  the  pulse,  the  leader 
eventually  learns  that  all  the  nodes  in  the  network  are  done.  At  that  time  it  broadcasts  a 
message  along  the  tree,  notifying  all  the  nodes  that  they  may  generate  a  new  pulse. 

The  time  complexity  of  Synchronizer  /?*  is  f 1(D),  because  the  entire  convergecast  and 
broadcast  process  is  performed  along  a  spanning  tree,  whose  depth  is  at  least  the  diameter 
of  the  network. 


3.3  Clock  synchronizer  7* 

Our  final  synchronizer,  called  synchronizer  7*,  combines  synchronizer  7  of  [Awe85a])  with 
the  network  partitions  of  [AP91,  AP90b], 

Definition  3.1  Given  an  n-vertex  weighted  graph  G(V,  E,w),  a  tree  edge-cover  for  G  is  a 
collection  M  of  trees,  such  that 

1.  every  edge  of  G  is  shared  by  at  most  O(logn)  trees  of  M , 

2.  the  depth  of  each  tree  in  M  is  at  most  O(logn  •  d),  and 

3.  for  each  edge,  there  exists  at  least  one  tree  containing  both  endpoints. 

Lemma  3.2  For  every  n-vertex  weighted  graph  G(V,  E,  w),  it  is  possible  to  construct  a  tree 
edge-cover. 

Proof:  The  desired  collection  of  trees  can  be  constructed  as  follows.  Apply  Thm.  1.1  to 
the  graph  G ,  with  the  initial  cover  taken  to  be  <5  =  {Path(u,  v,  G)  |  (u,u)  €  Ej,  and  the 
parameter  k  =  logn.  The  tree  edge-cover  is  now  constructed  by  selecting  a  shortest  path 
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spanning  tree  for  each  of  the  clusters  of  T.  The  desired  properties  are  guaranteed  by  the 
theorem  (noting,  in  particular,  that  Rad(S)  <  d  and  |<S|  =  n,  and  therefore  the  output  cover 
T  satisfies  Rad(T)  =  0(d\ogn)  and  A (T)  =  O(logn)).  | 

Preprocessing:  Construct  a  tree  edge-cover  for  G.  Inside  each  tree,  a  leader  is  chosen  to 
coordinate  the  operations  of  tree.  We  call  two  trees  neighboring  if  they  share  a  node. 

Pulse  generation:  The  process  is  performed  in  two  phases.  In  the  first  phase,  Synchro¬ 
nizer  (3  is  applied  separately  in  each  tree.  Whenever  the  leader  of  a  tree  learns  that  its  tree 
is  done,  it  reports  this  fact  to  all  the  nodes  in  the  tree  which  relay  it  to  the  leaders  of  all  the 
neighboring  trees.  Now,  the  nodes  of  the  tree  enter  the  second  phase,  in  which  they  wait 
until  all  the  neighboring  trees  are  known  to  be  done  and  then  generate  the  next  pulse  (as  if 
Synchronizer  a*  is  applied  among  trees).  More  details  will  be  given  in  the  full  paper. 

Complexity:  The  “congestion”  caused  by  the  fact  that  messages  of  different  trees  cross 
the  same  edge,  adds  at  most  an  0(log  n)  multiplicative  factor  to  the  time  overhead.  Since  the 
height  of  each  tree  is  O(dlog  n),  it  follows  that  the  time  to  simulate  one  pulse  is  0(d- log2  n). 


4  Network  synchronizers 

4.1  Construction  outline 

The  synchronizers  discussed  in  this  section  operate  by  generating  sequences  of  "clock-pulses” 
at  each  vertex  of  the  network,  satisfying  the  following  property:  pulse  p  is  generated  at  a 
vertex  only  after  it  receives  all  the  messages  of  the  synchronous  algorithm  that  arrive  at  that 
vertex  prior  to  pulse  p.  This  property  ensures  that  the  network  behaves  as  a  synchronous 
one  from  the  point  of  view  of  the  particular  synchronous  algorithm. 

The  problem  arising  with  synchronizer  design  is  that  a  vertex  cannot  know  which  mes¬ 
sages  were  sent  to  it  by  its  neighbors  and  there  are  no  bounds  on  edge  delays.  Thus,  the 
above  property  cannot  be  achieved  simply  by  waiting  “enough  time”  before  generating  the 
next  pulse,  as  may  be  possible  in  a  network  with  bounded  delays.  However,  it  may  be 
achieved  if  additional  messages  are  sent  for  the  purpose  of  synchronization. 

In  the  unweighted  synchronizers  of  [Awe85a],  incoming  links  are  “cleaned”  from  transient 
messages  in  between  any  two  consecutive  pulses,  similar  to  the  clock  synchronizers  in  Section 
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3.  In  our  (weighted)  case,  this  would  be  very  inefficient  since  cleaning  the  links  requires 
time  proportional  to  the  maximal  link  weight  W  1,  which  would  therefore  dictate  the 
multiplicative  overhead  of  the  synchronization.  The  idea  for  overcoming  this  problem  is  that 
links  of  high  weight  should  be  cleaned  less  frequently,  thus  enabling  to  amortize  the  cost  of 
cleaning  them  over  longer  time  intervals. 

Our  synchronizer,  denoted  ~fw,  is  also  a  modification  of  synchronizer  7  of  [Awe85a], 
Synchronizer  7  is  a  combination  of  the  two  simple  synchronizers  a  and  /?,  which  are,  in 
fact,  generalizations  of  the  techniques  of  [Gal82],  Synchronizer  a  is  efficient  in  terms  of 
time  but  wasteful  in  communication,  while  synchronizer  (3  is  efficient  in  com  min  -cation  but 
wasteful  in  time.  However,  we  manage  to  combine  these  synchronizers  in  such  a  way  that  the 
resulting  synchronizer  is  efficient  both  in  time  and  communication.  Before  describing  these 
synchronizers,  we  introduce  the  concept  of  safety  for  the  weighted  case. 

Definition  4.1  A  message  sent  from  a  vertex  v  to  one  of  its  neighbors  u  over  the  edge  e  =  (u,  u) 
at  pulse  q  is  said  to  affect  a  later  pulse  p  if  q  -f  w(e)  <  p.  A  vertex  v  is  said  to  be  safe  with 
respect  to  pulse  p  if  each  affecting  message  of  the  synchronous  algorithm  sent  by  v  at  earlier 
pulses  has  already  arrived  to  its  destination. 

Each  vertex  eventually  becomes  safe  w.r.t.  a  pulse  p  some  time  after  sending  all  of  its 
messages  from  earlier  pulses.  If  we  require  that  an  acknowledgment  is  sent  back  whenever 
a  message  of  the  algorithm  is  received  from  a  neighbor,  then  each  vertex  may  detect  that  it 
is  safe  w.r.t.  pulse  p  whenever  all  its  affecting  messages  have  been  acknowledged.  Observe 
that  the  acknowledgments  do  not  increase  the  asymptotic  communication  complexity,  and 
each  vertex  learns  that  it  is  safe  w.r.t.  pulse  p  within  constant  time  after  it  executed  its 
pulse  p  —  1. 

A  new  pulse  p  may  be  generated  at  a  vertex  whenever  it  is  guaranteed  that  no  affecting 
message  sent  at  the  previous  pulses  of  the  synchronous  algorithm  may  arrive  at  that  vertex 
in  the  future.  Certainly,  this  is  the  case  whenever  all  the  neighbors  of  that  vertex  are  known 
to  be  safe  w.r.t.  pulse  p.  It  remains  to  find  a  way  to  deliver  this  information  to  each  vertex 
with  small  communication  and  time  costs. 

We  need  to  state  a  number  of  definitions  first. 

Definition  4.2  Given  a  synchronous  protocol  7 r  running  on  a  synchronous  weighted  network 
G ( V,  E,w),  we  say  that  7r  is  in  synch  with  G  if  tt  transmits  a  message  on  edge  e  only  at  times 
that  are  divisible  by  w(c). 

Definition  4.3  A  weighted  network  G(V ,  E,w)  is  said  to  be  normalized  if  all  weights  tv(e)  are 
powers  of  2. 
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Informally,  our  solution  proceeds  according  to  the  following  plan. 


1.  Design  a  synchronizer  for  normalized  networks  and  protocols  that  are  in  synch  with 
the  networks  on  which  they  are  run. 

2.  Show  that  one  can  transform  an  arbitrary  synchronous  protocol  7r  and  synchronous 
network  G,  so  that  the  above  assumptions  are  satisfied,  without  significantly  increasing 
the  complexities. 

These  two  steps  are  described  in  the  following  two  subsections. 


4.2  Synchronizer 

We  assume  now  that  the  weights  of  all  network  edges  are  powers  of  2,  and  messages  axe  sent 
on  an  edge  of  weight  2'  only  at  times  divisible  by  2‘. 

Let  6  =  log  W .  We  define  a  collection  of  sub- networks  {G,(V,E,)  |  0  <  i  <  6),  by  defining 
E ,  to  be  the  set  of  edges  whose  weights  are  divisible  by  2‘.  (Note  that  an  edge  e  with  weight 
iv(e)  —  2J  occurs  in  all  graphs  G{  for  j  >  i.) 

The  idea  is  that  pulses  divisible  by  2‘  are  handled  by  a  so-called  synchronizer  7,,  which 
is  exactly  synchronizer  7  of  [Awe85a],  applied  to  the  graph  G{.  The  synchronizer  7 j  treats 
pulse  p  ■  2'  as  “super-pulse”  p.  It  guarantees  that  super-pulse  p  is  executed  only  after  all 
messages  sent  along  edges  in  Ei  at  super-pulse  (p  —  1)  have  arrived. 

A  vertex  has  to  satisfy  all  6  synchronizers  in  order  to  proceed  with  a  pulse.  More  specif¬ 
ically,  consider  a  pulse  p  =  2J  •  (2 r  -f  1),  i.e.,  such  that  2J  is  the  maximal  power  of  2  dividing 
p.  Then  pulse  p  is  postponed  until  super-pulse  (2r  +  1)  •  2J-‘  of  synchronizer  7 ,  is  executed. 
For  example,  pulse  24  =  3  •  23  is  completed  only  after  the  synchronizers  70,  71,  72,  and  73 
are  done  carryng  their  pulses  24,  12,  6  and  3,  respectively. 

Lemma  4.4  Synchronizer  7^  is  correct. 

Proof  Sketch:  We  need  to  show  that  under  synchronizer  yw,  a  vertex  v  generates  pulse 
p  only  after  receiving  all  messages  it  would  receive  by  pulse  p  were  the  protocol  executed 
on  a  synchronous  network.  This  follows  from  the  fact  that  the  set  of  message0  '"ould  g"t 
by  pulse  p,  i.e.,  the  set  of  messages  affecting  this  pulse,  includes  messages  sent  on  edges 
belonging  to  Gx  sent  at  pulse  p  —  2' ,  and  the  arrival  of  these  messages  is  guaranteed  by 
synchronizer  7,.  | 
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4.3  Designing  the  protocol  transformation 


In  order  to  justify  the  assumptions  of  the  previous  subsection  we  need  to  prove  the  following 
claim. 

Lemma  4.5  Given  a  synchronous  protocol  rr  running  on  a  synchronous  weighted  network  G(V,  E,  w), 
there  exist  a  synchronous  protocol  n'  and  a  synchronous  network  G'(V,  E,w ')  with  the  following 
properties: 

1.  G  is  normalized. 

2.  The  protocol  n1  is  in  synch  with  G. 

3.  The  output  of  7r'  on  G  is  identical  to  the  output  of  n  on  G. 

4.  The  time  and  communication  complexities  of  a  run  of  i r'  on  G  are  at  most  twice  higher 
than  the  complexities  of  the  corresponding  run  of  rr  on  G. 

The  lemma  is  proved  throughout  the  rest  of  this  subsection.  We  proceed  as  follows. 
Consider  a  message  M  transmitted  by  n  on  the  edge  e  =  (u,u)  with  weight  w(e)  =  w.  For 
this  message  we  define  the  following  quantities. 

Sm •  the  time  by  which  M  is  sent  by  u. 

Rm :  the  time  by  which  M  is  received  by  v. 

Pm'-  the  processing  time  of  M  at  v,  which  is  the  first  time  (at  t>)  by  which  the  contents  of 
that  message  might  actually  be  used,  i.e  ,  by  which  the  vertex  program  at  v  behaves 
differently  depending  on  the  contents  of  M  or  the  fact  that  M  has  not  been  sent. 

Clearly,  Sm  +  w  =  Rm  <  Pm-  However,  without  loss  of  generality,  we  can  assume 
Pm  =  Rm  for  the  protocol  n. 

Step  1:  Transform  r  into  a  synchronous  protocol  7r'  slowed  down  by  a  factor  of  4,  i.e., 
where  events  that  happen  at  time  t  at  7r  now  happen  at  time  t'  =  4 1  at  xr.  That  is,  for  any 
message  M,  the  sending,  receiving,  and  processing  times  S'M,  R'm-i  P'm  'n  are  shifted  as 


follows. 

=  45m 

R'm 

-  S'M  +  w  =  4  Sm  +  w 

P'm 

=  4/4/  =  4(,S’m  -f  4?/;) 
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Observe,  however,  that  edge  delays  are  not  stretched  by  a  factor  of  4,  and  hence  the 
arrival  time  R'M  of  M  in  7 r'  clearly  precedes  its  processing  time  P'M.  This  is  not  a  problem, 
since  the  message  can  be  kept  in  an  edge  buffer  and  be  effectively  ignored  until  time  t'  +  4w. 
This  implies  that  the  artificial  increase  in  the  delay  of  edge  e  from  w  to  any  value  below  4 w 
is  not  going  to  affect  the  protocol.  This  explains  the  next  step. 

Definition  4.6  Let  power(w)  denote  the  smallest  power  of  2  that  is  larger  than  or  equal  to  w, 

i.e.,  power(w)  =  2^ogu,l. 

Observe  that  w  <  power(w)  <  2w. 

Step  2:  Instead  of  running  z'  cn  G,  run  z'  on  the  network  G(V,  E,w),  where  w(e)  = 

power(w(e)).  The  sending,  receiving,  and  processing  times  S'M,R'M,P'M  in  tt'  on  G  are 
q>  _  qi 

=  S'M+power{u>) 

P'm  =  Pm 

Next,  it  is  necessary  to  guarantee  that  messages  are  sent  on  e  only  at  times  divisible  by 
power(w). 

Definition  4.7  Define  nextw(t )  as  the  first  time  after  t  that  is  divisible  by  power(w). 

Observe  that  t  <  nextw(t)  <  t  +  (w  —  1). 

Step  3:  Modify  z'  to  obtain  a  new  protocol  z" ,  running  on  G,  where  for  the  above  message 
A/,  the  sending,  receiving,  and  processing  times  S'm,R'm,P'm  in  shifted  as 

S'm  =  nextw(S'M ) 

R"\ 1  =  S'M  +  power  (w) 

P»  _  P' 

‘  M  —  rM 

The  main  difference  between  z'  and  z"  is  that  the  transmission  of  M  in  z"  may  be  delayed 
by  at  most  w.  This  does  not  cause  a  problem,  because  the  message  M  is  being  ignored  until 
time  Pm  =  Aw  +  S'jh  at  the  receiving  end;  thus  P ^  >  R"M  =  S'm  +  power(w)  still  holds. 

4.4  Complexity 

Lemma  4.8  The  synchronizer  7^  described  above  has  the  following  complexities: 

Cp(  jw)  =  0(k  ■  n  ■  log  W)  =  0(k  ■  n  ■  logn) 

Tp(lw)  =  0{\ogkn  ■  log  W)  =  0(log*n  •  logn) 
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Proof:  Synchronizer  7,  is  invoked  on  the  graph  G,  once  every  2*  time  units.  This  costs  us 
0(2'  •  n  •  k)  in  communication  and  0(2'  •  log*  n)  time.  This  waste  is  amortized  over  2*  time 
units,  and  then  summed  over  all  0  <  i  <  log  W  graphs  Gi.  | 


5  Controllers 

Controllers  are  applicable  in  situations  where  one  suspects  the  possibility  that  errors  in 
the  input  data,  or  processor  faults,  may  cause  a  given  protocol  n  to  diverge  away  from 
its  specification.  The  danger  is  that  while  the  protocol  may  have  firm  bounds  c*  and  t„ 
on  its  communication  and  time  complexity  under  normal  circumstances  (i.e.,  for  correct 
executions),  it  may  waste  valuable  resources  in  an  uncontrolled  fashion  once  it  diverges  at 
some  processors  in  the  network.  The  task  of  a  controller  is  to  transform  the  protocol  -rr  into 
a  “controlled”  protocol  <f>  whose  semantics  is  identical  to  that  of  the  original  n  under  correct 
input  but  whose  complexities  are  reasonably  bounded  even  on  incorrect  input. 

We  consider  here  the  same  model  as  in  [AAPS87].  The  protocol  starts  at  a  certain  vertex, 
called  the  initiator ,  and  vertices  enter  the  protocol  as  a  result  of  receiving  a  message  of  the 
protocol.  This  model  is  referred  to  as  “diffusing  computation”  by  Dijkstra  and  Scholten 
[DS80].  It  is  worth  mentioning  that  it  is  easy  to  extend  the  case  of  a  single  initiator  to  the 
general  case  of  multiple  initiators. 

Suppose  that  at  the  time  a  vertex  receives  the  first  message  of  the  protocol,  it  marks 
the  edge  over  which  the  message  has  been  received.  It  is  easy  to  see  that  the  collection  of 
marked  edges  forms,  at  all  times,  a  dynamically  growing  tree  rooted  at  the  initiator  vertex. 
This  tree  is  called  the  execution  tree  of  the  protocol. 

Our  purpose  is  to  control  the  growth  of  this  tree,  namely,  to  guarantee  that  not  too 
many  messages  are  sent  during  the  protocol,  without  affecting  the  “correct”  executions  of 
the  protocol. 

Towards  this  goal,  the  Main  CONTROLLER  of  [AAPS87]  views  every  message  sent  by  the 
protocol  as  consuming  one  unit  of  some  abstract  “resource”.  The  protocol  must  authorize 
every  single  consumption  of  the  resource.  That  is,  a  vertex  that  wants  to  consume  a  resource 
unit  (i.e.  send  a  message)  must  first  send  a  special  “request”  up  on  the  execution  tree,  and 
then  wait  to  get  a  special  “grant”  message,  before  the  resource  may  actually  be  consumed. 

The  most  naive  (but  inefficient)  way  of  controlling  the  amount  of  resource  units  consumed 
in  a  growing  tree  is  to  request  the  root  to  authorize  the  consumption  of  each  resource  unit. 
That  is,  for  each  resource  unit  consumed,  a  request  has  to  go  up  on  the  execution  tree 
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towards  the  root  Once  this  request  reaches  the  root,  the  root  increases  its  “permit  counter” 
by  1.  In  case  this  counter  is  still  less  than  or  equal  to  some  “threshold”,  it  sends  back  a 
permit,  which  authorizes  the  consumption  of  the  appropriate  resource.  Otherwise,  if  the 
counter  has  exceeded  the  “threshold”,  execution  is  suspended  and  the  execution  tree  stops 
growing. 

In  our  case,  we  need  to  set  this  threshold  to  the  value  of  c„,  which  is  the  complexity 
of  7r  in  a  correct  execution.  Thus  the  naive  controller  above  will  not  interfere  with  correct 
executions;  its  only  effect  is  to  stop  executions  that  are  obviously  incorrect. 

The  more  efficient  algorithm  of  [AAPS87]  is  similar  to  the  naive  controller  above  in  that 
the  resource  requests  are  propagated  up  to  the  root  and  the  permits  are  distributed  from 
the  root.  The  algorithm  of  [AAPS87]  takes  advantage  of  the  fact  that  a  single  message 
can  represent  a  number  of  resource  requests  or  a  number  of  permits.  The  main  idea  is  to 
aggregate  a  large  number  of  requests/permits  together,  represent  them  by  a  single  message, 
and  allow  this  message  to  travel  up  the  tree  for  a  distance  proportional  to  this  number. 
Permits  are  kept  not  only  at  the  root,  but  also  at  intermediate  vertices.  The  root  keeps 
an  approximate  permit  counter,  so  that  the  actual  number  of  resource  units  consumed  is  at 
most  twice  the  value  of  this  counter.  For  this  reason,  even  though  the  root  threshold  remains 
the  same  as  in  the  naive  controller  above,  the  guaranteed  upper  bound  on  the  number  of 
protocol  messages  sent  will  be  twice  greater  than  in  the  case  of  the  naive  controller,  namely 
2 cn.  Again,  the  algorithm  does  not  interfere  with  correct  executions  of  the  protocol. 

It  is  shown  in  [A  APS87]  that  the  communication  overhead  of  the  authorization  mechanism 
(namely,  permit/grant  messages)  results  in  at  most  log2  cr  control  messages  traversing  a  given 
edge  of  the  execution  tree.  It  follows  that  the  total  overhead  of  control  messages,  as  well  the 
total  complexity  of  the  resulting  protocol  <f>  is  c#  =  0(cn  log2c„). 

When  running  this  protocol  on  a  “weighted  network”,  we  consider  a  transmission  of  a 
message  on  an  edge  iv  as  a  request  to  consume  w(e)  units  of  the  resource.  Essentially,  this 
is  equivalent  to  running  the  same  algorithm  as  in  [AAPS87]  on  the  “unweighted”  version 
G  =  ( V”,  E,w)  of  the  network,  where  an  edge  e  is  substituted  by  a  path  containing  we  edges 
and  we  —  1  “dummy  vertices”,  and  weight  w(e)  =  1  for  all  edges  e.  Since  the  results  of 
[AAPS87]  do  not  depend  on  the  type  of  resource  being  consumed,  we  deduce: 

Corollary  5.1  The  CONTROLLER  protocol  given  in  [AAPS87]  transforms  an  arbitrary  pro¬ 
tocol  7r  into  an  equivalent  "controlled”  protocol  $  whose  (weighted)  communication  and  time 
complexities  are  c ^  =  0(c*  log2  c„)  and  =  0(c*  log2  c„),  respectively. 
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6  Basic  algorithmic  techniques 


In  this  section  we  briefly  describe  several  standard  network  algorithms  and  state  their  com¬ 
plexities  in  the  weighted  setting.  These  algorithms  will  be  used  as  basic  components  in  the 
more  involved  algorithms  to  be  presented  later. 

6.1  The  flooding  algorithm  C0N//ooj 

The  goal  of  the  flooding  algorithm  C0N/;„„j  (cf.  [Seg83])  is  to  broadcast  a  message  throughout 
the  network.  1’his  is  done  as  follows.  Kach  vertex  that  receives  the  message  for  the  first  time 
forwards  it  further  to  all  its  neighbors.  Future  arrivals  of  the  message  are  ignored. 

Fact  6.1  Algorithm  C0N//OL-„i  has  communication  complexity  0(£)  and  time  complexity  0{V). 
G.2  The  depth-first  search  algorithm  DFS 

The  goal  of  the  DFS  algorithm  DFS  (cf.  [Hve79,  AweSobj)  is  to  traverse  the  network  in 
depth-first  order,  in  order  to  make  use  of  this  algorithm  as  a  component  in  the  algorithms 
described  later,  it  is  necessary  to  modify  it  as  follows.  At  any  time,  the  algorithm  maintains 
estimates  of  the  total  cost  of  all  the  edges  traversed  so  far.  Such  estimates  are  kept  both  at 
the  center  of  the  activity  and  at  the  root,  and  are  called  the  root  estimate  ESTr  and  the 
renter  estimate  ESTc.  respectively.  The  estimates  are  updated  as  follows. 

1.  Kach  time  an  edge  is  traversed,  its  weight  is  added  to  the  center  estimate  KSTc. 

2.  The  root  estimate  ESTr  is  updated  only  whenever  the  renter  of  activity  is  about  to 
traverse  an  edge  that  will  cause  the  center  estimate  ES'l'c  to  double  compared  to  the 
current  value  of  ESTr.  The  update  is  done  via  a  message  from  the  center  of  activity 
to  the  root,  which  sets  ESTr  to  be  the  new  value  of  ESI}-. 

Thus,  the  center  estimate  is  the  total  sum  of  edge  weight*  tor  all  edges  traversed  so  far, 
while  the  root  estimate  is  a  lower  bound  on  the  total  edge  weight  for  all  the  edges  traversed 
so  far.  plus  the  next  edge  to  be  traversed.  Moreover,  this  lower  bound  is  within  a  factor  of 
two  of  the  real  value. 

Observe  that  going  to  the  root  whenever  the  estimate  is  doubled  can  at  most  double 
the  communication  complexity,  as  this  complexity  can  be  viewed  as  the  sum  of  a  geometric 
progression.  This  establishes  the  following  fact. 
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Fact  6.2  Algorithm  DFS  has  communication  complexity  O(S)  and  time  complexity  0{£). 

6.3  The  full-information  minimum  spanning  tree  algorithm  MSTctn,r 

1  he  full-information  MST  algorithm  MSTtfri(r  is  similar  in  structure  to  Prim’s  MST  algorithm 
(cf.  [Kve79]).  This  algorithm  proceeds  in  stages,  each  selecting  the  minimum-weight  edge 
connecting  a  vertex  in  the  tree  with  a  vertex  outside  the  tree,  and  adding  this  edge  to  the 
tree.  1  he  algorithm  terminates  when  the  tree  spans  all  the  vertices.  The  resulting  tree  is  an 
MS  I  of  the  network.  It  remains  to  show  how  to  implement  each  stage. 

'Throughout  the  execution,  the  invariant  maintained  by  the  algorithm  is  that  each  vertex 
in  the  tree  knows  the  structure  of  the  whole  tree.  To  preserve  this  invariant,  whenever  a  new 
vertex  is  added  to  the  tree,  its  name  is  broadcast  on  the  tree,  so  it  reaches  all  other  vertices. 
This  requires  only  one  message  on  each  tree  edge  for  each  new  vertex  added 

As  in  algorithm  DFS,  we  maintain  at  the  root  an  estimate  on  the  total  weight  of  all  tree 
edges,  called  the  root  estimate.  Observe  that  in  this  algorithm,  the  root  estimate  is  precise, 
as  the  root  knows  the  structure  of  the  whole  tree. 

In  order  to  analyze  the  algorithm  we  need  the  following  fact,  which  gives  an  upper  bound 
on  the  diameter  of  an  MST. 

Fact  6.3  For  any  minimum  spanning  tree  T  of  G,  Diam(T)  <  V  <  (n  —  1)'P. 

Proof:  The  first  inequality  is  trivial.  For  the  second,  consider  an  edge  e  =  (iq ,  r2)  £  T. 
Ily  definition  of  the  MST,  this  edge  induces  a  partition  of  V  into  two  subsets  of  vertices, 
V  =  1 1  U  V‘2,  such  that  iq  £  Vi,  v2  £  V2  and  e  is  the  minimum  weight  edge  connecting  a 
vertex  in  Iq  with  a  vertex  in  V2.  It  follows  that  e  is  a  shortest  path  from  iq  to  v2,  since  any 
other  path  from  iq  to  v2  contains  at  least  one  edge  with  one  endpoint  in  Vi  and  the  other 
(Midpoint,  in  V2.  Thus  w( c)  <  Diam(G)  —  V ,  and  hence 


V  =  w{T)  =  Y,  w(c)  <  (»*  -  1)X>. 

rCT 

I 

Kach  phase  of  the  algorithm  MSTffn(r  constructing  a  tree  T  requires  0(V)  communication 
and  ()( l)iam(T))  time.  There  are  exactly  n  -  1  phases.  Consequently  we  have 

Corollary  6.4  Algorithm  MSTrf.1(r  has  communication  complexity  ()(n  ■  V)  and  time  complexity 
0(min{nV,  n2V}). 
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6.4  The  full-information  shortest  path  tree  algorithm  SPTcen<r 

The  full-information  SPT  algorithm  SPTcentr  is  in  fact  a  distributed  implementation  of  Di- 
jsktra’s  algorithm  (cf.  [Eve79,  Gal82]),  computing  a  shortest  path  tree  rooted  at  a  source  s. 
This  algorithm  is  very  similar  to  the  MST  algorithm  MSTcen(r  described  above.  The  algorithm 
proceeds  in  phases,  each  adding  one  more  vertex  to  the  tree.  The  tree  vertices  know  the 
structure  of  the  whole  tree. 

The  vertices  outside  the  tree  are  labeled.  The  label  of  a  vertex  x  is  the  minimum,  over  all 
its  neighbors  y  in  the  tree,  of  dist(s,y)  +  w(y,x).  In  each  phase,  the  vertex  with  minimum 
label  is  added  to  the  tree.  Once  this  vertex  is  chosen,  its  name  is  broadcast  over  the  whole 
tree. 

Again,  we  need  a  basic  fact  giving  an  upper  bound  on  the  total  weight  of  a  shortest  path 
tree  in  order  to  analyze  the  algorithm. 

Fact  6.5  For  any  vertex  s  and  any  shortest  path  tree  7  for  s  in  G,  w(T )  <  (n  —  1)V. 

Proof:  Consider  an  edge  e  =  (u,v)  6  T.  Let  T'  be  an  MST  of  G',  and  let  P  denote 
Path(u,v,T'),  the  path  connecting  u  with  v  in  T'.  By  the  definition  of  an  SPT, 

w(e)  <  w(P)  <  w(T')  —  V. 


Thus, 


I 


w(T)  :  w(e)  -  (n  —  1)V* 

e€T 


Each  phase  of  the  algorithm  SPTcen(r  constructing  a  tree  T  requires  0(w(T))  communi¬ 
cation  and  0( V)  time.  There  are  exactly  n  —  1  phases.  Consequently  we  have 


Corollary  6.6  Algorithm  SPTcen(r  has  time  complexity  0{nV)  and  communication  complexity 

0{a*V). 


7  Connected  components  and  spanning  tree  construc¬ 
tion 

In  this  section,  we  prove  matching  upper  and  lower  bounds  on  the  communication  complexity 
of  performing  the  tasks  of  finding  connected  components  and  constructing  a  spanning  tree. 
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Figure  7:  The  graph  G9 


7.1  Lower  bounds 


Let  us  first  point  out  that  an  f I(£)  lower  bound  on  communication  is  given  in  [AGPV89]  for 
the  case  where  all  edge  weights  are  unity.  In  the  rest  of  this  subsection,  we  prove  an  0(n  •  V) 
lower  bound  on  the  communication  complexity. 

Consider  the  family  of  graphs  Gn  =  ( V,E,w )  defined  as  follows.  V  =  {l,...,n}.  The 
set  of  edges  is  composed  of  two  subsets,  E  =  Ep  U  Eb,  where  the  first  subset  creates  a 
path,  Ep  —  {(i,i  +  1)  |  1  <  i  <  n  —  1},  and  the  second  subset  consists  of  bypassing  edges , 
Eb  =  { (*,  n  -f  1  —  i)  |  1  <  i  <  n/2}.  The  weights  are  defined  as 


w(e) 


X,  e  €  Ep, 
X4,  e  <£  Eb, 


where  X  is  some  large  value,  say  X  >  n.  Figure  7  depicts  the  graph  Gn  for  n  =  9. 

Note  that  the  MST  for  G  is  the  subgraph  (V,  Ep)  based  on  the  path  alone,  so  V  =  nX. 

We  make  some  assumptions  similar  to  those  of  [AGPV89]  regarding  the  model.  In  par¬ 
ticular,  we  assume  that  the  only  operation  one  can  do  with  ID’s  is  comparisons;  this  can  be 
extended  also  to  general  operations  in  case  the  ID’s  are  allowed  to  be  sufficiently  large. 

In  addition,  let  us  formulate  the  following  concepts.  Note  that  every  vertex  in  our  graphs 
has  two  or  three  neighbors.  Let  us  assume  that  each  vertex  maintains  its  own  id  in  register 
Rj  and  the  id’s  of  its  neighbors  in  input  registers  R\,  R 2  and  R3  (if  needed).  The  only  things 
that  can  be  done  with  these  id’s  is  comparing  them  to  each  other  and  to  other  id’s.  The 
vertex  does  not  distinguish  between  the  registers.  In  particular,  it  cannot  tell  which  register 
contains  the  id  of  its  neighbor  along  the  bypassing  edge.  However,  in  order  to  enable  us  to 
speak  about  this  particular  register,  let  us  refer  to  it  also  as  the  bypassing  register,  Rg. 

Messages  may  include  vertex  id’s,  as  well  as  information  about  the  relationships  between 
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id’s,  e.g.,  “the  contents  of  register  R\  at  vertex  id  <j>  is  the  id  4>'.r>  A  vertex  i  may  know  any 
other  vertex  j  only  by  its  id,  4>{j).  When  we  say  “vertex  z  obtains  the  contents  4>'  of  register 
Rk  of  vertex  j”  we  mean  that  at  some  stage  of  the  run,  z  learns  the  fact  that  register  Rk  at 
the  vertex  whose  id  is  <f>(j)  contains  the  id  <f>'.  This  can  happen  either  by  i  being  itself  j  or 
by  its  getting  a  message  with  that  statement.  Similarly,  when  saying  “vertex  i  obtains  the 
id  of  vertex  jv  we  mean  that  at  some  stage  of  the  run,  the  id  <f>(j )  becomes  available  to  i, 
i.e.,  either  i  is  itself  j  or  it  gets  a  message  containing  <f>{j)  (as  the  id  of  a  vertex,  i.e.,  as  the 
contents  of  register  Rj  of  some  vertex). 

Let  A  be  a  deterministic  algorithm  that  succeeds  in  computing  a  spanning  tree  on  every 
input  graph  and  whose  communication  complexity  is  /(n)  =  o(n4).  In  particular,  this  means 
that  there  exists  a  constant  n0  such  that  for  every  n  >  n0,  the  algorithm  A  completes  the 
construction  of  tree  on  Gn  with  communication  cost  less  than  n4.  Clearly,  then,  the  algorithm 
does  not  send  any  messages  over  any  bypassing  edge  in  these  graphs,  since  using  such  an 
edge  immediately  incurs  a  cost  of  n4.  Henceforce  we  restrict  attention  to  graphs  Gn  for 
n  >  n0. 

Our  proof  is  based  on  the  following  lemma. 

Lemma  7.1  In  the  execution  of  A  on  Gn,  for  every  1  <  i  <  nj 2  there  is  some  vertex  j  such 
that  one  of  the  following  two  events  must  happen. 

1.  j  receives  both  the  id  of  i,  <t>(i),  and  the  id  stored  in  (n  +  1  —  z)’s  bypassing  -egister  Rb- 

2.  j  receives  both  the  id  of  n  +  1  —  i,  <j>(n  +  1  —  z),  and  the  id  stored  in  i  s  bypassing  register 

Rb- 

Proof:  Suppose  that  there  is  some  1  <  i  <  n/2  for  whom  the  lemma  does  not  hold.  Consider 
the  graph  G'n  obtained  from  Gn  by  adding  two  vertices  v,  w  and  replacing  the  edge  (i,  n  +  1  —  i) 
with  the  two  edges  (z,  v)  and  (n+1  —  z,  xv),  both  with  weight  X4.  Figure  8  depicts  the  graph 
G’n  for  n  =  9  and  z  =  3. 

Select  the  id  assignment  </>(&)  =  2k  for  k  £  V,  and  the  id’s  <p(v)  =  2(n  +  1  —  z)  —  1  and 
<+>(w)  =  2z  —  1. 

Consider  the  run  of  A  on  the  graph  G'n.  We  claim  that  the  executions  of  A  on  Gn  and 
G'n  are  similar.  This  is  proved  inductively  on  the  length  of  the  runs,  noting  that  as  long 

as  no  messages  are  sent  over  the  edges  connecting  u  and  w  to  the  rest  of  the  graph,  any 

comparison  made  by  any  vertex  has  the  same  result  except  a  comparison  of  the  id  of  r  to 
the  contents  of  the  bypassing  register  Rb  of  n  +  1  —  z,  or  a  comparison  of  the  id  of  n  +  1  —  z 
to  the  contents  of  the  bypassing  register  Rb  of  i.  Since  no  vertex  holds  both  these  values. 
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Figure  8:  The  graph  G9 


the  runs  will  remain  similar.  This  implies  that  in  G'n ,  the  vertices  v  and  w  will  not  join  the 
spanning  tree,  contradicting  the  assumption  that  A  operates  correctly  on  every  graph.  | 

Lemma  7.2  Algorithm  A  requires  ft(nV)  messages.  | 

Proof:  Consider  the  execution  of  A  on  Gn  and  pick  some  1  <  i  <  n/2.  It  follows  from  the 
previous  lemma  that  messages  containing  the  names  i  or  n  +  1  —  i  were  passed  over  at  least 
n  +  1  —  2 i  edges  during  the  run.  Thus  the  message  complexity  of  A  is  at  least 

n/2 

X  £(n  +  i  -  2t)  >  n2X/ 4  =  fi(nV). 

i= 1 

I 


7.2  An  upper  bound 


Claim  7.3  Algorithm  CON^.d  (presented  below)  requires  0(min{£,n  -V})  messages. 


We  now  describe  Algorithm  CON^^d,  whose  communication  complexity  is  the  minimum 
between  those  of  the  algorithms  DFS  and  MSTcen(r  presented  above.  In  effect,  algorithm 
C0N/,y(,rid  runs  algorithms  DFS  and  MSTcentr  in  parallel.  The  idea  is  that  the  root  vertex 
“controls”  both  algorithms  and  suspends  the  more  expensive  one.  Towards  this  goal,  the 
root  maintains  the  variables  Wa  and  Wb  which  are  the  root  estimates  of  algorithms  DFS  and 
MSTcen(r,  respectively.  In  addition,  the  root  maintains  the  variable  Permit  which  takes  values 
{DFS;  MSTCen<r}  and  is  updated  whenever  Wa  or  Wa  are  updated  at  the  root.  As  a  rule, 


Permit  = 


DFS,  Wa  <  Wb, 

MSTcentr,  otherwise. 
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When  Permit  =  DFS,  algorithm  DFS  is  running  and  algorithm  CONcen(r  is  suspended,  and 
vice  versa. 

Observe  that  it  is  easy  to  suspend  either  of  the  two  algorithms,  as  algorithm  DFS  (re¬ 
spectively,  MSTce„tr)  needs  to  be  suspended  only  when  Wa  (resp.,  W4)  is  increased.  At  that 
moment,  the  center  of  activity  of  algorithm  DFS  (resp.,  MSTce„(r)  is  located  at  the  root,  so 
the  algorithm  can  be  suspended  by  requesting  that  the  center  of  activity  stays  at  the  root. 

Since  at  any  time  the  root  estimates  are  within  a  factor  of  two  of  the  actual  commu¬ 
nication  costs  of  both  algorithms,  and  since  only  the  algorithm  with  the  smaller  estimate 
is  enabled  at  any  given  time,  the  total  complexity  of  algorithm  CON^yfe-id  cannot  exceed  the 
complexity  of  the  cheaper  of  the  two  algorithms  by  more  than  a  factor  of  four. 


8  Fast  minimum  spanning  tree  algorithms 

This  section  proceeds  as  follows.  In  order  to  explain  the  more  complex  algorithms  MST/asf  and 
MST hybrid,  we  start  with  a  description  of  the  algorithm  of  [GHS83],  referred  to  as  “algorithm 
MST^,”,  and  show  that  it  has  communication  complexity  0(£  +  V-log  n)  and  time  complexity 
0(£  +  V  ■  logn).  Then,  we  develop  our  new  algorithms: 

•  An  algorithm  with  communication  complexity  0(min{£  -f  Vlogn,n  •  V}). 

•  An  algorithm  HST/as(  with  communication  complexity  0(£  log  V  logn)  and  time  com¬ 
plexity  0(Diam(MST)  log  V  log  n)  =  0{nV  log  V  log  n). 


8.1  Algorithm  MST5/,s 

The  algorithm  consists  of  two  stages,  namely,  a  wake-up  stage  followed  by  a  work  stage.  The 
wake-up  stage  of  [GHS83]  is  based  on  flooding  a  “wake-up”  message  through  the  network. 
The  work  stage  can  be  thought  as  a  modification  of  the  following  simple  algorithm.  The 
algorithm  consists  of  logn  phases.  At  each  phase,  we  have  a  number  of  “fragments”,  i.e., 
subtrees  of  the  MST,  that  try  to  merge,  in  parallel,  into  larger  fragments.  Towards  this  goal, 
each  fragment  performs  the  following  steps: 

1.  The  name  of  the  fragment  is  broadcast  from  the  root  to  all  the  vertices. 

2.  Each  vertex  scans,  serially  and  in  decreasing  order  of  weights,  all  edges  that  have  not 
been  scanned  so  far,  until  an  edge  outgoing  from  the  fragment  is  found.  The  name  of 
that  edge  and  its  weight  are  reported  to  the  parent. 
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3.  Each  vertex  collects  reports  from  its  children  containing  the  name  of  the  edge  chosen 
by  the  subtree  rooted  at  the  child.  Among  those,  the  minimum  weight  edge  is  selected 
and  propagated  to  the  parent.  The  vertex  also  marks  its  edge  to  the  child  in  whose 
subtree  the  selected  edge  is  located. 

4.  When  the  root  selects  its  outgoing  edge,  the  path  from  the  selected  edge  to  the  root 
has  been  marked.  Now,  the  root  of  the  tree  is  moved  along  this  path,  and  the  fragment 
is  hooked  onto  another  fragment  along  the  outgoing  edge. 

It  is  easy  to  implement  the  above  algorithm  if  the  phases  of  different  fragments  are 
synchronized.  However,  synchronization  requires  scanning  all  the  edges.  The  algorithm  of 
[GHS83]  manages  to  accomplish  its  task  without  synchronization,  thus  saving  communica¬ 
tion.  For  more  details,  see  [GHS83]. 

Lemma  8.1  Algorithm  MSTs/,s  requires  0(£  +  V  •  logn)  communication  and  0(£  +  V  •  logn) 
time. 

Proof:  The  wake-up  stage  obviously  takes  0(£)  communication  and  0(T>)  time.  Throughout 
the  algorithm,  each  non-tree  edge  is  scanned  at  most  twice,  and  each  tree  edge  is  scanned 
log  ft  times.  It  follows  that  the  communication  complexity  is  0{£  -f  V  •  logn).  The  time 
complexity  is  naturally  bounded  by  the  communication  complexity.  | 

Unfortunately,  this  algorithm  does  not  exhibit  any  parallelism;  its  time  complexity  could 
be  almost  as  high  as  its  communication  complexity.  This  is  due  to  the  fact  that  the  edges 
scanned  in  a  given  phases  may  be  much  heavier  that  the  weight  of  the  MST  itself.  Thus, 
edge  scanning  contributes  a  term  of  £  to  the  time  complexity.  The  communication  over  MST 
contributes  additional  0(Diarri(MST)  ■  logn)  =  0(V  •  logn)  time. 

8.2  Algorithm  MST  hybrid 

The  “hybrid”  MST  algorithm,  called  MST  hybrid,  is  obtained  according  to  the  following  plan: 

1.  Modify  the  wake-up  phase  of  algorithm  MSTff/„  so  that  instead  of  flooding,  wake-up  is 
performed  via  the  DFS  algorithm  of  the  previous  section. 

2.  Combine  the  resulting  algorithm  with  algorithm  MSTcen(r,  as  in  Section  7.2. 

The  idea  of  the  first  step  is  that  the  protocol  becomes  “controlled”,  i.e.,  the  root  is  aware 
of  the  communication  wasted  so  far.  This  makes  it  easy  to  perform  the  next  step,  where  we 
achieve  the  minimum  between  the  two  algorithms. 
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Corollary  8.2  Algorithm  MSThybnd  has  communication  complexity  0(min{£  +  Vlogn,nV}). 


8.3  Algorithm  MST fast 

The  algorithm  MST/as*  is  a  modification  of  algorithm  MST3/i4.  The  idea  i  to  reduce  the  time 
it  takes  to  scan  very  heavy  edges  that  obviously  do  not  belong  to  the  MST.  Also,  we  wish 
to  avoid  the  time-consuming  process  of  scanning  the  edges  serially.  Towards  this  goal,  we 
modify  the  process  of  selecting  an  outgoing  edge  of  a  fragment  as  follows. 

In  order  to  avoid  the  scanning  of  heavy  edges,  the  root  makes  a  “guess”  for  the  weight 
of  the  outgoing  edge.  Initially,  this  guess  is  1.  If  the  guess  is  too  low,  then  the  process  of 
searching  for  an  outgoing  edge  fails.  In  this  case,  the  root  doubles  its  guess  and  repeats  the 
search.  This  continues  until  the  search  succeeds. 

In  order  to  achieve  concurrency  in  the  edge  scanning  process,  the  vertices  scan  all  the 
edges  that  are  below  the  value  guessed  by  the  root  in  parallel.  This  guarantees  that  the  time 
of  the  search  is  upper-bounded  by  0(Diam(M ST)). 

Corollary  8.3  Algorithm  MST/aat  requires  0{£  -lognlogV)  communication  and 
0(Diam(MST)  log  V  log  n)  =  0(nV  log  V  log  n)  time. 


9  Fast  shortest  path  tree  algorithms 

9.1  Algorithm  SPT synch 

Algorithm  SPTsync/,  is  the  fastest  SPT  algorithm  we  know  of.  It  is  obtained  by  combining 
the  synchronizer  of  Section  4  with  the  synchronous  SPT  algorithm. 

Observe  that  the  synchronous  SPT  algorithm  runs  in  0(T>)  time  and  has  0(£)  com¬ 
munication  complexity  (again,  making  the  assumption  that  a  message  sent  on  edge  e  re¬ 
quires  precisely  w(e)  time).  The  synchronizer  adds  O(kn\ogn)  communication  overhead 
and  O(logfcnlogn)  time  units  for  each  of  the  T>  time  units  of  the  original  synchronous  pro¬ 
tocol.  We  thus  have 

Corollary  9.1  The  algorithm  SPT synch  requires  0(£  +  V  ■  kn  ■  logn)  in  communication  and 
0(V  ■  logfc  n  •  log  n)  time. 
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9.2  Algorithm  SPTrecur 


Observe  that  a  “weighted”  network  G  =  ( V,E,w )  can  be  reduced  to  a  BFS  problem  on  an 
“unweighted”  network  G  =  ( V,E,w )  where  an  edge  e  is  substituted  by  a  path  containing 
we  edges  and  we  —  1  “dummy  vertices”,  and  weight  w(e)  =  1  for  all  edges  e.  Thus,  we  can 
construct  an  SPT  of  the  original  graph  by  running  the  BFS  algorithm  as  in  [Awe89]  on  the 
“unweighted”  version  G  =  (F,  E,w)  of  the  network. 

The  BFS  algorithm  of  [Awe89]  is  based  an  a  very  simple  BFS  algorithm,  referred  to  in 
the  sequel  as  the  DlJ KSTRA  Algorithm,  due  to  its  resemblance  to  Dijkstra’s  shortest  path 
algorithm  (cf.  [Gal82j)  and  Dijkstra-Sholten  distributed  termination  detection  procedure 
[DS80].  This  algorithm  is  similar  to  Algorithm  SPTcentr  of  Section  6,  with  the  only  difference 
being  that  the  names  of  vertices  that  join  the  tree  do  not  have  to  be  broadcast  along  the 
tree. 

The  algorithm  maintains  a  tree  rooted  at  the  source  vertex.  Initially,  the  tree  is  empty. 
Upon  termination  of  the  algorithm,  the  tree  is  the  desired  BFS  tree.  Throughout  the  algo¬ 
rithm,  the  tree  can  only  grow,  and  at  any  time  it  is  a  subtree  of  the  final  BFS  tree.  The 
algorithm  operates  in  successive  iterations,  each  adding  one  more  BFS  layer  to  the  tree.  At 
the  beginning  of  a  given  iteration  /,  the  tree  contains  all  vertices  in  layers  less  than  or  equal 
to  /  —  1.  Upon  termination  of  iteration  /,  the  tree  is  extended  to  layer  l  as  well. 

The  complexities  of  this  algorithm  are  0(d  n  +  E)  messages  and  0(d  ■  D )  time,  where  d 
is  the  number  of  layers  being  processed.  Indeed,  there  are  d  iterations  and  in  each  of  them 
synchronization  is  performed  over  the  BFS  tree,  which  requires  0(n )  messages  and  0(D) 
time.  In  addition,  one  exploration  message  is  sent  over  each  edge  once  in  each  direction.  Ob¬ 
viously,  the  performance  of  the  algorithm  degrades  as  the  number  d  of  layers  to  be  processed 
increases. 

The  idea  behind  [Awe89]  is  to  reduce  the  problem  where  a  large  number  of  layers  needs 
to  be  processed  to  a  problem  with  a  small  number  of  layers.  If  the  network  has  diameter  D, 
we  can  conceptually  “slice”  the  network  into  “strips”  of  length  d,  and  process  those  strips 
sequentially  (see  Figure  9). 

In  [Awe89],  an  efficient  reduction  strategy  is  proposed.  It  is  proved  in  [Awe89]  that,  in 
the  unweighted  case,  0(DC)  time  units  are  spent  on  each  BFS  layer,  and  0(EC)  messages 
traverse  each  network  edge.  In  our  (weighted)  case,  we  have  V  layers,  and  £  edges  in  the 
“unweighted”  version  of  our  network.  Thus,  we  have 

Corollary  9.2  The  weighted  communication  and  time  complexities  of  the  resulting  protocol 
SPTrecur  are  0(£1+<)  and  0(P1+<),  respectively. 
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Figure  9:  Strip  Method. 


9.3  The  hybrid  SPT  algorithm  SPT hybrid 

As  before,  it  is  possible  to  combine  the  two  SPT  algorithms  SPTjv„ca  and  SPTrecur  and  obtain 
an  algorithm  SPThybrid  that  is  as  efficient  in  communication  as  either  one  of  the  two.  This  is 
done  in  a  manner  similar  to  design  of  the  hybrid  MST  algorithm  in  Section  8.2. 

Corollary  9.3  The  weighted  communication  and  time  complexities  of  the  resulting  protocol 
SPT^ir,^  are  given  by  0(min{£1+e,£  +  n  ■  V}). 
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